Matrix-vector form of the Bellman equation

\begin{aligned} v_{π} (s) & = E [R_{t + 1} | S_{t} = s] + γ E [G_{t + 1} | S_{t} = s], \\ = \underset{mean of immediate rewards}{\underset{⏟}{\sum_{a \in A} π (a | s) \sum_{r \in R} p (r | s, a) r}} + \underset{mean of future rewards}{\underset{⏟}{γ \sum_{a \in A} π (a | s) \sum_{s^{'} \in S} p (s^{'} | s, a) v_{π} (s^{'})}} \\ = \sum_{a \in A} π (a | s) [\sum_{r \in R} p (r | s, a) r + γ \sum_{s^{'} \in S} p (s^{'} | s, a) v_{π} (s^{'})], for all s \in S . \end{aligned} (2.7)

The Bellman Equation in (2.7) is in an elementwise form. Since it is valid for every state, we can combine all these equations and write them concisely in a matrix- vector form, which will be frequently used to analyze the Bellman equation.

To derive the matrix- vector form, we first rewrite the Bellman equation in (2.7) as

v_{π} (s) = r_{π} (s) + γ \sum_{s^{'} \in S} p_{π} (s^{'} | s) v_{π} (s^{'}) (2.8)

$r_{π} (s)$ denotes the mean of the immediate rewards,
$p_{π} (s^{'} | s)$ is the probability of transitioning from $s$ to $s^{'}$ under policy $π$

r_{π} (s) ≐ \sum_{a \in A} π (a | s) \sum_{r \in R} p (r | s, a) r,

p_{π} (s^{'} | s) ≐ \sum_{a \in A} π (a | s) p (s^{'} | s, a) .

Suppose that the states are indexed as $s_{i}$ with $i = 1, \dots, n$ , where $n = | S |$ . For state $s_{i}$ , (2.8) can be written as

\begin{array}{r} v_{π} (s_{i}) = r_{π} (s_{i}) + γ \sum_{s_{j} \in S} p_{π} (s_{j} | s_{i}) v_{π} (s_{j}) . (2.9) \end{array}

Let $v_{π} = [v_{π} (s_{1}), \dots, v_{π} (s_{n})]^{T} \in R^{n}$ , $r_{π} = [r_{π} (s_{1}), \dots, r_{π} (s_{n})]^{T} \in R^{n}$ , and $P_{π} \in R^{n \times n}$ with $[P_{π}]_{i j} = p_{π} (s_{j} | s_{i})$ . Then, (2.9) can be written in the following matrix- vector form:

v_{π} = r_{π} + γ P_{π} v_{π}, (2.10)

where $v_{π}$ is the unknown to be solved, and $r_{π}, P_{π}$ are known.

The matrix $P_{π}$ has some interesting properties.

First, it is a nonnegative matrix, meaning that all its elements are equal to or greater than zero. This property is denoted as $P_{π} \geq 0$ , where 0 denotes a zero matrix with appropriate dimensions. In this book, $\geq$ or $\leq$ represents an elementwise comparison operation.
Second, $P_{π}$ is a stochastic matrix, meaning that the sum of the values in every row is equal to one. This property is denoted as $P_{π} 1 = 1$ , where $1 = [1, \dots, 1]^{T}$ has appropriate dimensions.

Consider the example shown in Figure 2.6. The matrix- vector form of the Bellman equation is

[\begin{matrix} v_{π} (s_{1}) \\ v_{π} (s_{2}) \\ v_{π} (s_{3}) \\ v_{π} (s_{4}) \end{matrix}] = [\begin{matrix} r_{π} (s_{1}) \\ r_{π} (s_{2}) \\ r_{π} (s_{3}) \\ r_{π} (s_{4}) \end{matrix}] + γ [\begin{matrix} p_{π} (s_{1} | s_{1}) & p_{π} (s_{2} | s_{1}) & p_{π} (s_{3} | s_{1}) & p_{π} (s_{4} | s_{1}) \\ p_{π} (s_{1} | s_{2}) & p_{π} (s_{2} | s_{2}) & p_{π} (s_{3} | s_{2}) & p_{π} (s_{4} | s_{2}) \\ p_{π} (s_{1} | s_{3}) & p_{π} (s_{2} | s_{3}) & p_{π} (s_{3} | s_{3}) & p_{π} (s_{4} | s_{3}) \\ p_{π} (s_{1} | s_{4}) & p_{π} (s_{2} | s_{4}) & p_{π} (s_{3} | s_{4}) & p_{π} (s_{4} | s_{4}) \end{matrix}] [\begin{matrix} v_{π} (s_{1}) \\ v_{π} (s_{2}) \\ v_{π} (s_{3}) \\ v_{π} (s_{4}) \end{matrix}] .

Substituting the specific values into the above equation gives

[\begin{matrix} v_{π} (s_{1}) \\ v_{π} (s_{2}) \\ v_{π} (s_{3}) \\ v_{π} (s_{4}) \end{matrix}] = [\begin{matrix} 0.5 (0) + 0.5 (- 1) \\ 1 \\ 1 \\ 1 \end{matrix}] + γ [\begin{array}{cccc} 0 & 0.5 & 0.5 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \end{array}] [\begin{matrix} v_{π} (s_{1}) \\ v_{π} (s_{2}) \\ v_{π} (s_{3}) \\ v_{π} (s_{4}) \end{matrix}] .

It can be seen that $P_{π}$ satisfies $P_{π} 1 = 1$ .

Figure 2.6: An example for demonstrating the matrix-vector form of the Bellman equation.